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Abstract 

The mark-recapture method was devised by Petersen in 1896 to estimate the number 
of fish migrating into the Limfjord, and independently by Lincoln in 1930 to estimate 
waterfowl abundance. The technique can be applied to any search for a finite number 
of items by two or more people or agents, allowing the number of searched-for items to 
be estimated. This ubiquitous problem appears in fields from ecology and epidemiology, 
through to mathematics, social sciences, and computing. Here we exactly calculate the 
moments of the hypergeometric distribution associated with this long-standing problem, 
confirming that widely used estimates conjectured in 1951 are often too small. Our Bayesian 
approach highlights how different search strategies will modify the estimates. The estimates 
are applied to several examples. For some published applications substantial errors are found 
to result from using the Chapman or Lincoln-Petersen estimates. 
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1 Introduction 



If a finite set is searched by two or more people it is possible to estimate how many of the 
searched-for items have been missed. The simple Lincoln-Petersen estimate was independently 
developed by Petersen (1896) to estimate fish numbers migrating between the German sea and 
the Limfjord, and by Lincoln (1930) to estimate waterfowl abundance. The technique has rapidly 
grown in popularity since a more rigorous treatment by Chapman (1951), especially in the context 
of ecological census techniques (Seber 1982, Sutherland 2006) and epidemiology (Hook and 
Regal 1995). Our interest arose from the technique's application to assess the accuracy of a 
literature search. In 1938 such a literature search led to the re-discovery of Alexander Fleming's 
papers on penicillin (Masters 1946, Lax 2004), and penicillin's subsequent development. Today 
literature searches are a valued method for identifying and appraising evidence, particularly in 
evidence-based healthcare (Sackett et al. 1996). Reviews often search thousands of papers, 
and standardised guidelines have developed for reporting search terms and the databases used 
(Liberati et al. 2009, Higgins &l Green 2011). Common practice involves an electronic search to 
retrieve hundreds or even thousands of potentially relevant articles, that are subsequently searched 
by the authors for pertinent material. Inevitably, even if multiple authors search the database, 
human error may cause some papers to be erroneously missed at this stage, leading to a less 
comprehensive review (Edwards et al. 2002). The Lincoln-Petersen estimator has previously been 
used to assess the completeness of medical databases (Spoor et al. 1996, Bennett et al. 2004, 
Poorolajal et al. 2010), and to provide "stopping rules" to help determine when searches are 
complete (Kastner et al. 2009, Booth 2010); surprisingly, standard practice does not include an 
estimate for the number of papers unintentionally omitted by a search. 

Here we derive some simple but rigorous results for estimating the number of items missed 
from a search, including exact expressions for the average, standard deviation, and skewness. 
They correct a widely used conjecture from Chapman's 1951 paper and a subsequent widely used 
approximation for the variance. Despite their extensive use (Seber 1982, Hook and Regal 1995, 
Sutherland 2006), we confirm the suggestion (Garcfa-Pelayo 2006) that previous conjectured and 
approximated estimates can be inaccurate for many cases of interest, including assessing the 
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Figure 1: The total number of papers found {Nf) equals the number found by A {Na), plus the 
number found by B {Nb), minus the number found by both A and B {Nab), that have 
been counted twice. 

accuracy of literature searches. 

The problem is as follows. Authors A and B each separately search a given set of references 
for relevant articles. (It is assumed that after agreement by both authors, papers that are included 
are definitely relevant.) The result is that A^^ and Nb articles are found by authors A and B 
respectively with A^^^ of those found by both authors. If we assume all papers are equally likely 
to be found, then a simple estimate can be made as follows. Taking N as the total number 
of papers searched for, and taking probabilities pa, Pb, and Pab for A, B, and both (A and B) 
finding A^^, Nb, and Nab papers respectively, then we can estimate pa, Pb, and pab, ffom 

PA^^ 

PB^^ (1) 

Pab ~ TV 

Because the probability pab of a paper being found by both authors is pab = Pa ^ Pb, we can 
combine and solve ([1]) for A^, giving an estimate for A^ as 

(2) 

Nab 

The number of papers missed, X , is then estimated to be X = A^ — Nf, where Nf = Na + 



Nb — Nab is the total number of different papers found by both authors (figure [T]), finding after 
a little algebra, 

X ~ j-j (3) 

J-^AB 

Equations ((21) and are often reasonable estimates if the numbers involved are large. How- 
ever these estimates are clearly misleading if Nab = Na, Nb, or is zero: for the former cases 
because there can be papers that both authors have missed (although the estimate suggests 
not); and for the latter case because an infinite estimate is inconsistent with searching a finite 
set. More importantly, there is no indication for the accuracy of the estimate, so used in isolation 
it is impossible to know whether it is reasonable or not. Improved estimates are given later by 
(IlQl) . (!20l) . (!23l) . and (!24l) : the need for them and their derivation is explained in the following 
sections. The key assumption underlying all of these estimates is that all items are equally likely 
to be found. As is discussed at the end of Section [31 when this assumption is true or a reasonable 
approximation, then the estimates can be used. 

The paper proceeds as follows. Section [21 uses a Bayesian approach to allow a rigorous math- 
ematical derivation of the probability density function for the number of items missed. Section [31 
considers the calculation of its moments. "Exact estimates" , refer to exactly calculated moments 
of the distribution. "Approximate estimates", refer to approximations for the moments, usually 
found by expanding about the distribution's maximum. Consequently approximated averages are 
often close to the "most probable" estimate, where the distribution is a maximum. Section [H 
comments on the effects of different assumptions on the final answer, and finds explicit prior 
assumptions for which Chapman's estimate is exactly the most probable estimate. The main 
result of this paper is to show that the moments can be calculated exactly, subsequently finding 
that Chapman's extensively used estimate can sometimes be misleading. A recently published 
example discussed in Section [31 emphasises this. 

Throughout the paper we refer to two search procedures. In the example above, both authors 
searched for all the papers [N) and compared the number found by both [Nab) to estimate 

^ NaNb/Nab- An alternative approach is for A and B to search for a predetermined number 
of items Na and Nb respectively, stopping when that number is found, and again using the 
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number A^^^ found by both to estimate ^ NaNb/Nab- Whereas the former approach is 
more sensible for a literature search, the latter approach allows a comparatively small sample 
of animals to provide an estimate for their abundance. Mathematically the difference can be 
important. If a fixed number of items A^^ are searched for, then other than the requirement that 
Na < N, Na is independent of N. In contrast, if all items are searched for then the probability 
of A finding A^^ items is dependent on A^. Equivalent remarks apply to B. Section [2] uses Bayes 
theorem to rigorously formulate the problem for both search procedures. Section [3] notes that 
provided that a large number of items are found, then the moments of both problems are closely 
related, and the moments of one can be used to closely approximate the moments of the other. 
The consequences of different search procedures are discussed further in Section IH Section [5] 
summarises the paper's conclusions. 

2 Bayesian formulation 

The shortcomings with ((2]) and ([3]) arise from the estimates of pA — Na/N, ps — Nb/N, and 
Pab — Nab/N. They improve with increasing values of A^^, A''^, and Nab, but are nonetheless 
estimates. Specifically, if we know the probability pa of author A finding any given paper (we 
continue to assume all papers are equally difficult to find), and if we also knew the total number 
of papers A^ that the author is searching for, then the probability of author A finding A^^ papers 
is given by the binomial distribution. 



The expected number of papers to be found is then (A^^) = Y^^^^qNaP{Na\N,pa) = PaN 
[e.g. Stirzaker (1994)). Therefore provided A^^ ~ {Na), as on average it will be, then the 
estimates ([T]) will be reasonable. However, for small numbers in particular it can give misleading 
results. 

Bayes' theorem (Bayes Su. Price 1763) was first used for mark and recapture estimates by 
Gaskell Su. George (1972), and allows a rigorous derivation that avoids these shortcomings. In 
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its modern form Bayes' theorem states that P{X\Y)P{Y) = P{Y\X)P{X) (Sivia 2005), and 
allows us to write, 

Repeatedly using P{X,Y) = P{X\Y)P{Y) (Sivia 2005), and conditional independence of A^a 
{Na < N), Nb {Nb < N), given A^, this expands to give, 

i^{I\A, ^Vfi, I\ABj 

Equation 1^ gives the probability of there being papers to find, given that author A has found 
Na papers, author B has found Nb papers, and A^^^ of the papers were found by both authors. 
P{N) is the (prior) probability of there being A^ papers to be found given no information about 
the numbers of papers A and B will find, P{Na\N) is the probability of finding A^^ papers given 
that there are A^ papers to be found, and equivalently for P(A^b|A^). P{Nab\Na, Nb, N) is the 
probability of A^^^ papers being found by both authors, given that there are A^ papers to find, 
and that authors A and B each find A^^ and A^^b papers respectively. 

2.1 Searches for every item 

Firstly consider P{Na\N), and assume that all A^ items are searched for. Given no prior knowl- 
edge of how effective author A may be at finding papers, we take P{Na\N) to be function- 
ally independent of A^^- Correct normalisation requires that Y.Na=o ^i^M^) = 1' giving 
P{Na\N) = 1/{N + 1), and similarly for P(A^b|A^). Equivalently, assume pa and A^ are in- 
dependent, and take P{Na\N,pa) as given by dH). Then use marginalisation (Sivia 2005) to 
write P{Na\N) = Jq P{NA\N,pA)P{pA)dpA, assume a uniform prior for P{pa), and integrate 
to find the same answer. This latter approach suggests how the method can be generalised if 
we relax the assumption that all items are equally likely to be found, through modified forms for 
P{Na\N,pa) and P{pa)- P{Nab\Na, Nb, N) is the probability of there being A^^is items found 
by both A and B, given only the information that A found Na items, B found Nb items, and that 
there are A^ items to find. This can be calculated by using a metaphor of selecting balls from 
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an urn filled with white balls. The first author picks A^^ balls at random, paints them yellow, 
and returns them. The second author picks Nb balls, and A^^^ is the number of yellow balls the 
second author has picked. This is a well-known problem (e.^. Stirzaker (1994, p. 174)), whose 
solution is the hypergeometric distribution, 

with Nab < Na < N and A^^^ < A^b < A^. 

Combining the above and ([7]) with P(A^a|A^) = P{Nb\N) = 1/(A^+ 1) we get, 

P(N\N N N ^ (N - Na)KN - Nb)\ PjN) 

PiN\NA, Nb, Nab) - ^,(^_^^), (^^TTTp^ (8) 

where C is functionally dependent on A^^, Nb, and Nab, but not A^, and is most easily found by 
ensuring that P(A^|A^a, Nb, Nab) is normalised to 1 after summing over A^ from the total number 
of different papers found Nj = Na + Nb — Nab, to oo. This Bayes' theory approach was used by 
Zucchini Su. Channing (1986) to derive a similar result, but without the factors of P{Na\N) and 
P{Nb\N) that lead to some differences discussed later. Note that because the sum is over A^ not 
Nab, the moments are different to those usually associated with the hypergeometric distribution 
that involve sums over Nab- 

2.2 Searching for a predetermined number of items 

If authors A and B search for a fixed number of say 10 items each, so that A^^ and Nb are now 
specified in advance, then the previous derivation is modified slightly. As before. Nab < Na < A^ 
and Nab < Nb < N , but A^ and A^^^ can otherwise be assumed independent of A^^ and 
Nb- If / is some prior information, such as the number of items A^^ to be searched for by A 
and the number of items Nb to be searched for by B, then Bayes' theorem gives (Sivia 2005) 
P{X\Y,I) = P{Y\X,I)P{X\I)/P{Y\I). Substituting A^ for X, Nab for F, and Na, Nb for /, 
Bayes' theorem gives, 

^V^ABl^^A, J-^B) 
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If we make the prior assumption that all values of (greater than or equal to the largest of A^^ 
and Nb), are equally likely, then P(N\Na, Nb) will not depend on A^. This is an "improper", 
i.e. un-normalisable, prior. Strictly P{N\Na, Nb) should be zero for N bigger than the largest 
conceivable number of items in the set being searched. With this assumption the factor of 
P(N\Na, Nb) is replaced with a constant term, leaving, 

(10, 

where, as for C in (JH]), K is functionally dependent on N^, Nb, Nab, and is most easily found 
by ensuring that (ITPl) is correctly normalised. This is the equation whose approximated moments 
have been extensively used (Seber 1982, Sutherland 2006, Hook and Regal 1995) and studied 
(Chapman 1951, Zucchini &i Channing 1986, Seber 1970, Wittes 1972, Garcia-Pelayo 2006), and 
that we will exactly calculate shortly. 



3 Results 

Given a suitable choice for P{N) or P(A^|A^^, A''^) respectively, (IHl) and 1^ provide the full 
solution to the problem, allowing numerical values for the average and standard deviation to be 
calculated by summing from A^ = A^/ to A^ = oo for different moments of A^. The following 
section takes the prior P(N\Na, Nb) as being constant, then calculates the moments of ( ITOf ) 
exactly. It also gives an (often excellent) approximation for the moments of (JH]) when the prior 
P{N) is constant, and suggests a prior for which the calculated moments are exact. Throughout 
we will use the statistical physics notation of angled brackets, with e.g. {f(N)), to denote the 
expected value of some function f{N), obtained by averaging over the probability density function 
for A^. Firstly we will calculate moments of the extensively studied (ITOi) . and compare these exactly 
calculated moments with existing approximations. Then we will consider the moments of ([H]), 
and use these in some applications. 
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3.1 The moments of ( IIOD 

To calculate the moments we first rewrite (ITOl) in terms of X = — Nf, Xa = Na — Nab, and 
Xb = Nb- Nab, so that Nf = Nab + Xa + Xb, and, 

PiXlX,,X,.N,s)^^-^^0^K (11) 

This gives a probability distribution for the number of papers X that have not been found, with 
X between and oo. The moments of (ITIi) are calculated next using a generating function 
approach. Appendix [A] contains an alternative (our original) calculation for the moments that 
is less systematic, but uses simpler mathematical concepts and avoids the use of generating 
functions. The moments of ( ITT!) can be written, 



a \P v-oo {x+Xa)\{x+Xb) 

Vdz) ^X=0 Xl{X+Nfy. 




2 = 1 


^x=o x\{x+Nfy. ^ 






z=l 





(XP\ = ^ ' y^:yy^^i-<f,: 2 = 1 /-ION 



where the operator (zd/dzyf{z)\z=i represents applying z x d/dz to f{z) p times, and then 
evaluating the result at z = 1. The denominator of (IT^ is simply 1/K. Equation (11^ differs 
slightly from conventional moment generating functions (Stirzaker 1994), in that the factor 
of z before d/dz ensures that repeated application of (zd/dz) yields the moments, not the 
"factorial moments" (Stirzaker 1994) that would be obtained by repeatedly applying (d/dz). 
The hypergeometric function is defined for \z\ < 1 by (Arfken 1985), 

^'f^! n\{n + c)! 

provided c 7^ 0, —1, —2, ... . It also has an integral representation (Arfken 1985), 

2Fi(a + 1, 6 + 1, c + 1, ^) = /' t\l - ty-'-\l - tzr'^-'dt (14) 

bl[c — — ly. Jo 

that is valid for \z\ < 1 and z = 1 provided Re(c+ 1) > Re(6 + 1) > 0. This standard result (IT4I) 
is not obviously symmetric with respect to a and b as would be expected from ( IT3l) . however the 
expected symmetry is recovered later in ( IT9ll and (!20|) when the calculation is complete. As a 
consequence of (TT^ . ( I12p can be written as. 



(z-§;y2Fi{XA + l,XB + l,Nf + l,z) 



2F,{XA + l,XB + l,Nf + l,z)l^^^ 
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with the requirements of Re(A^/ + 1) > Re{XB + 1) > 0, clearly satisfied. Equation (1151) is 
easily evaluated. Firstly use (fT4l) to substitute for 2-^1 (-^a + ^,Xb + l,Nf + l,z), then take 
derivatives, and set z = 1. The resulting integral can be evaluated using the beta function's 
identity (Arfken 1985), 

that holds provided Re(c + 1) > Re(a + 1) + Re(6 + 1) and Re(6 + 1) > 0, a requirement that 
will restrict the values of A^^^ for which the resulting formulae can be used. This is relatively 
straightforward because for t E (0,1) and \z\ < 1, (1 — tz)^"-^^ is continuous with respect to 
both t and z, and we can bring the derivative with respect to z inside the integral. Then noting 
that, 



d ( 1 a + l a + 1 



z- 



dz \{l-tz) 
and applying zd/dz to (IT4l) p times, we get 

dz 



= (17) 

; {i-tzY+^ (i-tz)«+i ^ ^ 



d 

Ztt- 1 2i^i(a + 1,6 + l,c+ 1,2;^ 



2=1 



2Fi(a + 2,6+ l,c+ 1,2; 
2Fi(a+ 1,6+ l,c+ 1,2;) 



2=1 



(18) 



where the use of ( JTTi) can be seen by setting p = 1. Equation ( ITSi ) can be iterated until the right 
hand side is a function of 2-^1(0, 6, c, 1), for various as, b's, and c's, and can be evaluated using 
IHM . For {X) this gives the average number of items missed as, 

(X) = — ^ with Nab > 2 (19) 

[1\AB — 2j 

where Xa, Xb, and Nj have been written in terms of A^^. Nb, and Nab, and A^^^ > 2 arises 
from the requirement on a, 6, and c, that allows ( [TBI) to be used. Similarly the standard deviation 
cr is found from, 

^2 ^ (iV.-iV.. + l)(iV.-iV.. + l)(iV.-l)(iV.-l) ^ 3 

(iV^B - 2)' (iV^B - 3) ^ ^ 
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Higher moments are also easily calculated and expressions for the skewness and kurtosis are given 
in the appendix. Equations (HM and (!2U!) are exact under the assumptions for which the prior 
P(N\Na, Nb) in (ITU!) does not depend on A^. The constraints on the minimum value of A^^^ 
for which the expressions hold is a mathematical requirement, and appears to be a requirement 
for the series to converge. As discussed later, this requirement on A^^^ can be overcome with a 
suitably convergent prior distribution P{N). Because both A^^ and Nb are greater than or equal 
to Nab, then Nab > 2 will require A^yi > 2 and Nb > 2 also. 



3.2 Comparison with Chapman's estimate 

Previous approaches have approximated these same average and standard deviation by a combina- 
tion of conjecture and estimations for the precision and bias (Chapman 1951, Seber 1970, Wittes 
1972, Seber 1982). It has been observed (Garcia-Pelayo 2006) that previous (approximate) esti- 
mates can be inaccurate for combinations of A^^, Nb, and Nab that cause the hypergeometric 
distribution to have a 'long tail', for example if A^^ ^ A^^^- These remarks can now be clarified. 
Chapman's (1951) estimation gives (A^) ^ - 1, and (X) = (A^) - Nf, as, 

, . ^ {NA-NAB)iNB-NAB) , . 

^ {Nab + 1) ^ ' 

Comparing this with (IT^ (for example by subtracting from (11^ ). we can see that: 

1. it is always less than (I19p . 

2. that this is more pronounced when either or both of (A^^ — Nab) or (Nb — Nab) afe large, 
or when Nab is small, but that conversely, 

3. provided neither A^^ nor Nb equals Nab, it will give the same (unbiased) estimate if Nab 
is sufficiently large compared with both (A^^ — Nab) and (A^^s — Nab)- 

Similar remarks apply to the widely used estimate for the variance (Seber 1970), that has 

2 {Na + 1){Nb + 1){Na - Nab){Nb - Nab) ,.2^ 
^ {NAB + mNAB + 2) ^ ^ 
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and is unbiased for A^^^ ^ 1, but accuracy requires an increasingly large A^^^ if either {Na — 
Nab) or (Nb — Nab) are small, and in practice it can be inaccurate. 

Seber (1970, 1982) has remarked that Chapman's calculations are equivalent to approximating 
(ITU|) with a Poisson distribution. Appendix[B]finds this requires both ^ {Na — Nab) /Nab ^ 1 
and ^ {Nb - Nab) /Nab < 1, (and implicitly that Nab > !)■ When this is true, the 
mean of the approximating Poisson distribution coincides with the maximum of (II ID with {X) = 
(Na- Nab){Nb- NAB)/Nf, and approximates both 1^ and (JH]) (for this limit). Similarly for 
the variance. In contrast ( IT9l ) and ( !20l) result from exactly calculating the moments of ( ITTI) . As 
noted in Appendix [Bl this Poisson approximation generalises to the situation studied by (Garcfa- 
Pelayo 2006), in which there are n searches instead of only two. 

3.3 The moments of ([8]) 

When all items are searched for by both A and B, the probability distribution for the number of 
items searched for is given by ([5]). For the common choice of prior with P{N) constant. Appendix 
[C] shows how the moments of (JH]) can be closely approximated using the moments of (ITPI) . and 
calculates rigorous maximum bounds for the error in the approximation. When Nf ^ 1 the error 
will be small and a good approximation is given by. 

Nab 

with an error that is less than ±(X)/(A^/ + 1). Unfortunately = (X^) — (X)^ can be arbitrarily 
small, but the approximation for cr^ of, 

a' = (A-.-^^B + l)(iVB - iV.B + l)(iV. + l)(iVB + l) ^ ^ 



has a maximum error that is of order (X^)/X/. Consequently unless (X^)/X/ <^ 1, is not 
guaranteed to be a good approximation for a^. Often there will be a prior reason to expect that 
X ^ 1. For these cases an alternative approach is to assume the almost constant prior of. 
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with K constant, that may be written as P{N) = k{1 — 1/{N + 2)), and monotonically increases 
from P(0) = k/2 to P(oo) = k. This prior gives a small bias against low values of but is 
approximately constant for larger values of A^. For example, P{N) varies by less than ten percent 
between = 8 and = oo. For this prior (JH]) becomes, 

P{N\N^, Ns, N^b) = (^ + 2)!(Ar-Ar,)! ^ (^6) 

Remembering that Nf = Na + Nb - Nab, then rewriting in terms of (A^ + 2), (A^^ + 2), 
{Nb + 2), and (A^^^ + 2), it will be clear that the change of variables that replaces: (A^ + 2) with 
A^, {Na + 2) with Na, {Nb + 2) with Nb, {Nab + 2) with A^^^, makes ((21) the same form as 
(flPl) . The condition that N = Nf may be written as (A^ + 2) = (A^A + 2) + (A^i? + 2)-(A^AB + 2), 
so after the change of variables the lower limit N = Nf on sums for the moments remains the 
same. The upper limit of A^ = oo is clearly also unchanged. Consequently the exact moments 
of (I2S]) can be found by replacing Na with Na + 2, Nb with Nb + 2, and A^^^ with Nab + 2, 
in the exactly calculated moments of (llOp . with for example (1190 and (I^Ul) becoming (1230 and 
Mj). (An alternative presentation of these remarks can be found in Appendix[Cl) With the prior 
I5\i . (123]) and are exact moments of ([5]), and the error bounds now provide a bound on the 
maximum possible difference between estimates calculated with this, and with a flat prior. For 
those cases when it is reasonable to assume this prior, we think it is preferable to explicitly use 
it along with the exact estimates (!23l) and (!24l) . in preference to assuming a constant prior and 
treating (!^ and (1^41) as approximations. 

Both (1231) and (!24f) are more similar to the Chapman and Lincoln-Petersen estimates than 
din]) and (I2UI) . This is despite them being approximations to the moments of ((H]), not (ITUI) . that 
Chapman's calculation is intended to approximate. This might help explain why the discrepancy 
between Chapman's estimate and IHM is generally overlooked. For many cases of interest the 
number of items found {Nf) is large, with Nf ^ 1, and for these cases (1231) provides an accurate 
estimate for (X). Next we consider some examples. 
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3.4 Examples 

When A and B each search for a number of items that is predetermined in advance of their search, 
then (IT^ and (!^ provide simple estimates for the maximum number of items that could be found 
by a search for all items, and the precision of the estimate. They are exact moments of (JTOl). When 
all items are searched for, provided the number of items found {Nj) is much greater than one, 
then a very good estimate can be made using (123 . and if the prior P{N) = k{N + 1)/(A^ + 2) 
is assumed then (1231) and are exact moments of (JH]). Both pairs of estimates can give 
substantially different estimates to those of Chapman (12 ID and Lincoln-Petersen ([3]). For example, 
Chao et al. (2008) propose a method to combine multiple intersections of lists and the Lincoln- 
Petersen or Chapman estimator, with the intention of improving the accuracy of epidemiological 
estimates. The number of items in common between lists is not predetermined, and is anywhere 
between zero and every item on the shortest list. Their proposed method is illustrated in Section 
4 of Chao et al. (2008), and the estimates calculated by the method are given on the top of 
page 968, where they are calculated from the numbers in their Table 5b using the Chapman 
and also the Lincoln-Petersen estimate. The results of their calculations are reported in Table 6 
on page 968 of their paper, and repeated in part in Table [U The total number of items {Nj) 
is much larger than one in all cases, and consequently an accurate estimate is given by (1231) . 
An immediate concern is that the Chapman and Lincoln-Petersen estimates are estimators for 
the moments of (ITUl) . that arise from a search procedure for a predetermined number of items, 
and should not be used. It is a fortunate coincidence that the moments of (JH]) are closer to 
the Chapman and Lincoln-Petersen estimates than are the exact moments of (ITUl) that they are 
intended to approximate. They are also estimates for the most probable population size, and not 
the expectation of the population size, which can be much larger. For the cases in Table 5b of 
Chao et al. (2008) where (1231) and (!24l) are defined, we find the revised estimates given in Table 
[H Also included are the estimates from Table 6 of Chao et al. (2008), and Seber's estimate for 
the variance. Our estimates are substantially different, and in some cases N^b is too small to 
allow them to be used. It is unusual, but not unreasonable, to find distribution functions without 
a well-defined mean or standard deviation. Without a suitable prior distribution the female list 
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Na 


Nb 


Nab 


(N) 


a 


{N)c 


{N)lp 


0"5 


Male 


323 


101 


3 


11014 


7638 


8261 


10874 


3599 


Female 


21 


19 


1 


438 


undefined 


219 


399 


115 


Combined 


344 


120 


4 


10434 


5890 


8348 


10320 


3067 



Table 1: Estimates for (iV) = Nf + {X) and a are calculated using ([23]), ([24]), and the numbers 
in Table 5b of Chao et al. (2008), that are reproduced above as Na, Nb, and Nab- The 
estimates from Table 6 of Chao et al. (2008), that use the Chapman {{N)c) and Lincoln- 
Petersen ((A^)lp) estimates for {N), and Seber's estimate for the variance {crs), si's also 
included. Our estimates, where they are defined, are substantially different to the quoted 
estimates (Chao et al. 2008) that use the Lincoln-Petersen ([3]) and Chapman (j2ip estimates. 

for the "shared population" of Chao et al. (2008) will fall into this category. For such cases it is 
necessary to (explicitly) use a suitable prior if estimates are to be correctly made. 

Smaller deviations from the usual Lincoln-Petersen and Chapman estimates are expected when 
Nab is sufficiently large compared to Na and Nb- For example, in a recent review by May et al. 
(2011), there were 177 relevant papers found by author A, 265 papers found by author B, and 
171 of these papers found by both authors (K.E. May, private communication). Using ( l23l) and 
( 1241) . we find (X) ~ 3.9 and a = 2.5. Therefore whereas 271 papers were found, our estimate 
gives between 1 and 6 missed papers. Putting it another way, the estimate is that between 
97.6% and 99.5% of the papers searched for from within the total sample of just over 8 thousand 
papers were found. The standard estimates (Chapman 1951, Seber 1970) give {X) = 3.3 and 
cr = 2.3, and are somewhat smaller despite the reasonably large value of Nab = 171. Another 
literature search example (Spoor et al. 1996) found Na = 150, Nb = 123, and Nab = 115, for 
which (1231) and (!24l) give (X) = 2.8 and a = 2.0. These compare with the standard estimates 
(Chapman 1951, Seber 1970) that give, (X) = 2.4 and a = 1.8. 
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3.5 Limitations of the model 

Underlying the calculation is the assumption that all items are equally likely to be found. Clearly 
there will be cases where some items are more difficult to find. However even in those cases, 
some (lower bound) estimate for the number of items missed is better than no estimate at all. 
The method will fail most dramatically if there is a sub-population that is much more difficult 
to find; it is possible that both searchers could miss all or most of that sub-population, and will 
overestimate the accuracy of their search. These limitations should be considered before applying 
these estimates, and when reporting them. If there is a (prior) reason to think the assumptions 
are inappropriate, one way that modified assumptions can be included is through different priors 
for P{Na\N) and P(Nb\N) as was discussed in Section [?7ll In general this will give distribution 
functions that are most easily calculated numerically. 

4 Bayesian corrections and other search procedures 

An advantage of the Bayesian approach is that the assumptions are explicit at the outset and the 
resulting answers are exact, with no additional free parameters. Before concluding we consider 
two easily evaluated examples that illustrate how different prior assumptions and different search 
procedures affect the estimates. 

4.1 One partial and one comprehensive search 

Firstly imagine a situation where one author {e.g. A) searches for a fixed number of papers so 
that P(Na\N) no longer appears in ([5]), but the other author (B) searches for as many papers as 
possible with P{Nb\N) = 1/(A^ + 1), with no prior knowledge of the number of papers searched 
for other than it being finite {P{N) constant). For this case (HOD is modified by the factor 
becoming 1/(A^ + 1)!. In Section it was explained how a suitable change of variables could 
transform (!^ into the same form as (ITUI) . allowing the moments of (1251) to be calculated from 
those of (ITUI) by a simple change of variables. The same is true here, the change of variables that 
replaces: (A^ + 1) with N, {N^ + 1) with A^^, {Nb + 1) with A^^, {Nab + 1) with Nab, leads 
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to the same form of P{N\Na, Nb, Nab) as (ITU!) . Similarly to Section [231 because the equation 
N = Nf = Na + Nb - Nab may be written as (A^ + 1) = (A^^ + 1) + {Nb + 1) - {Nab + 1), 
the lower limit on the range of summation for the moments remains unchanged by the change 
of variables, as does the = oo upper limit. Consequently the exact moments can be found by 
replacing Na by Na + 1, Nb by Nb + 1, Nab by Nab + 1, in (UnD and 1^, giving, 

/^\ {Na-Nab + 1){Nb-Nab + 1) 

(X) = — with Nab > 1 (27) 

V^AB — J-j 



and, 



2 {Na - Nab + 1){Nb - Nab + 1){Na){Nb) 



a' = ^ \ ' with Nab > 2 (28) 

{Nab -if {Nab -2) ^ ' 

Interestingly, for this search procedure the standard capture- recapture estimate conjectured 
by Chapman of (A^) ^ ^'^^jv^b+i)^^'' ~ ^' approximates the "most probable" value of A^, 
where P(A^|A^^, A^'^, A^yi^) is a maximum. The maximum can be approximated by setting 
P{N\Na,Nb,Nab) = P{N - 1\Na,Nb,Nab) and solving for A^ (Chapman 1951, Garcia- 
Pelayo 2006). For the stated prior assumptions this gives, 

{N - NAy.{N - Nb)\ ^ [N-NA-mN-NB-iy. , . 

{N + iy.{N-NA-NB + NABV. N\{N - Na- Nb + NAB-iy. ^ ^ 

whose solution for A^ is exactly Chapman's conjectured estimate. (Strictly this estimate is only an 
approximation to the most probable value of A^: a more precise value can be found using Stirling's 
approximation for the factorials and differentiating with respect to A^ to find the maximum of 

P{N\Na,Nb,Nab)-) 

4.2 The influence of a proper prior 

To illustrate the effect of P(A^), consider the normalisable prior P{N) = k{N + 1)/{N + 2){N + 
3)(A^-|-4) ~ k/N"^, with k constant, and let both A and B search for as many items as possible 
with P{Na\N) = P{Nb\N) = 1/(A^ + 1). For this example (HU]) is modified by 1/A^! becoming 
l/(A^-|-4)!. Following a similar change of variables as discussed above and in Section [T3l but 
now with: (A^ + 4) replaced by A^, {Na + A) with A^^, (A^ij + 4) with Nb, {Nab + 4:) with Nab, 
then P{N\Na, Nb, Nab) becomes the same form as in (ITUl) . Consequently modified estimates 
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can be found by substituting Na with + Nb with + and Nab with Nab + 4:, in (fl^ 
and (12U1) . leading to a reduced estimate for (X). 

Notice that for this latter example the requirement that Nab > 3 in (!^ becomes (with 
Nab replaced by Nab + 4), Nab > —1, and the estimates hold for all Na, Nb, and Nab- 
The conclusion is that whereas IHM and (I^Di) can only be used when Nab, Na, and Nb are 
sufficiently large (> 3), when all items are searched for (resulting in the extra factor of 1/(A^ + 1)^ 
in P(N\Na, Nb, Nab)), the equations apply for a greater range of values. In fact unless Nab 
is sufficiently large, then estimates can only be calculated with a sufficiently convergent {i.e. 
realistic) prior for a given search strategy (such as searching for a fixed number of items, or for 
all the items). In summary, it is important to ensure that the assumptions upon which any given 
estimate depends are consistent with the problem being studied. 

5 Conclusions 

The original purpose of this calculation was to consider two authors A and B searching a finite 
set of papers for those to include in a literature survey, and to use the number of papers found by 
authors A {Na) and B (Nb), along with the number found by both authors {Nab), to estimate 
how accurate the search was. Bayes' theorem is used to rigorously formulate this "mark-recapture" 
problem for two different search procedures. The first procedure corresponds to A and B searching 
for all of the items, the second corresponds to A and B each searching for a predetermined number 
of items, before comparing their results to allow an estimate for N. For the latter case, exact 
calculations lead to simple formulae for the average number of items missed from the search (IT9l) . 
and the standard deviation (I2UI) . The skewness and kurtosis of the probability distribution are 
given in the appendix and higher moments may be calculated in a similar way. 

Equations IHM and (!^ are exact moments of the widely-studied probability distribution (TTU!) 
from Chapman's 1951 paper, which is shown here to result from a procedure in which A and 
B each search for a predetermined number of items. Previous estimates using this distribution 
have been derived using a combination of conjecture and approximations. Chapman's conjectured 
estimate is found (under suitable assumptions) to be an approximation to the most probable value 
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of A^. This provides a good approximation to (1191) if is large and both searchers individually find 
the majority of the items searched for, but is increasingly bad if either searcher finds substantially 
more (or fewer) items than their partner, which can often be the case. 

For many cases such as the literature search application, all items are searched for by both 
A and B, which leads to a modified probability distribution ((H]). If a constant prior is assumed 
then the moments of (JH]) can be closely approximated provided the number of items found {Nj) 
is much greater than one, which will very often be the case. When this is the case, an excellent 
approximation for the number of items missed is given by ( !23l) . Alternately if there is a prior reason 
to think ^ 1, then it is reasonable to use the almost constant prior P{N) = k(N + 1)/(N+2), 
and the calculation for the estimates of (1230 and becomes exact. For estimates arising from 
this search procedure, there is a smaller difference between them and Chapman's estimate (which 
we have shown here does not apply, and in principle should not be used), but it can still be 
substantial. We recommend using the improved estimates given by (1191) . ( I20l) . (1231) . and (124!) . 
as is appropriate to the search procedure. 

The formulae apply to an enormously wide variety of problems with two independent searches 
in which the number of items found by searcher A {Na), searcher B {Nb), and the number found 
by both {Nab), can be determined. By "independent" , we mean that A finding an item does not 
affect the probability of B finding it (e.g. for mark-and-recapture, animals do not become "shy" 
or "tame" after handling). Finally we caution against an assumption used in the calculation - 
that all objects searched for are equally likely to be found. This will fail if there is a sub-population 
that is much more difficult to find, for which case both searchers will appear to have found the 
majority of items and will over-estimate the accuracy of their search. These issues are beyond 
the intended scope of this paper. Nonetheless even when the assumption is only approximately 
true (often the assumption will be good), these improved estimates (1190 . (1^ . (1231) . and 
will hopefully provide a valuable standard tool for literature searches and more generally. 
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A The moments 



Here we briefly present our original derivation of the moments of (11 01) . that uses simpler mathe- 
matical concepts, but is less conventional and systematic than the generating function approach 
presented in the main text. Repeating (ITUl) here for convenience, with, 

and X between and oo. Next define, 

S{X^, X„ N,) = E^.o (31) 

where we note that Nf = Nab + Xa + Xb, and also that K = 1/S{Xa, Xb, Nj). The aim is 
to express the moments in terms of the function S{Xa, Xb, Nj), evaluate S{Xa, Xb, Nf) using 
an identity due to Gauss, then combine the results to obtain explicit expressions for the moments 
in terms of Xa, Xb, and Nf. 
Starting with (X), notice that, 

v-oo y _ ^oo X (X-1+Xa+1)!(X-1+Xb+1)! 

^X=0^ X\[x+Nfy. - ^X=l x\ (x-l+JV^+l)! 



oo 1 ((X-1)+(Xa+1))!((X-1)+(Xb+1))! 

^=1 (^-1)! ((X-1)+(JV; + 1))! 



(32) 



Y.X=0 X\[x+Nf+l)\ 

SiXA + l,XB + l,Nf + l) 



Hence, 



Similarly for (X^) 



S{XA + l,XB + l,Nf + l) 
^""^ S{XA,XB,Nf) 



Y-oo v2 {X+Xa)KX+Xb)1 _ ^co X (y , i ^ (X-1+Xa+1)!(X-1+Xb+1)! 

^X=0^ X\{^X+Nf)\ - ^^=1 Xl - ^ + (X-l+^/+l)! 

_ ^oo {(X-l)+l) ((X-1)+(X^ + 1))!((X-1)+(Xb+1))! (oa\ 
l^X=\ (X-\)\ ({X-l)+{iV;+l))! ^^^1 

- 2.X=0 (^ + x\[x7nj+i)\ 

Repeating the same trick to remove the factor of X then gives, 

,^2^ ^ S{XA\1,XB\1,Nf^2) g(X^ + l,XB + l,iV/ + l) 

^ ^ S{XA,XB,Nf) + S{XA,XB,Nf) ^ ^ 
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Similarly but with more algebra for the higher order moments, e.g. 

^ ' SiXA,XB,Nf) SiXA,XB,Nf) SiXA,XB,Nf) 

(36) 

and, 

\ / S{XA,XB,Nf) + " S{XA,XB,Nf) 

^ S(XA+2,XB+2,Nf+2) S{XA+l,XB+l,Nf+l) ^ ' 

^' S{XA,XB,Nf) ^ S(XA,XB,Nf) 

To evaluate S{XA,XB,Nf), we firstly note that the hypergeometric function has for \z\ < 1 
and c 7^ 0, -1, -2, ... (Arfken 1985), 

. , . . N c! ^ (a + n)\ih + n)\ z'^ 

For z = 1 an identity due to Gauss gives (Arfken 1985), 

2F1 (a + 1, & + 1, c + 1, 1) = ^^"^ t}^^^""'^ ~ . Re(c) > Re(a + 6) + l (39) 

1 (c — aji [c — 0) 

with c 7^ 0, —1, —2, ... , as above. Equations ( l38i) and ( l39l) may be combined to give (for 

z = l). 

^ (a + n)\(b + n)\ 1 a\b\T(c + l)T(c - a - b - 1) 

y . \. — r = — ; — ^ ^/ ^ ^ , Re c > Re a + 6+1 40) 

^0 {c + n)\ n\ c! r{c-a)r{c-b) ^' ^ ' ^ ' 

Therefore with the replacements of n = X, c = Nf, a = Xa, and b = Xb (so that c = 

a + b + Nab > (a + 6) + 1 for Nab > 1), we get, 

- Y \ Y I {Nf-XA'XB-2y. Ar ^ 1 ^ ^ 

Hence substituting into ( !33l ) gives, 

(X4 + 1)(A'b + 1) (A'.4+1)(-Yb + 1) 
(iV;-A-,-A-,-2)= (iV.„-2) C*') 

where the inequality follows from the requirement that X/ + 1 > {Xa + 1) + (Xb + 1) + 1 with 

Nf = Nab + Xa + Xb- Similarly, 

{Xa + 1)(X^ + 2){Xb + 1){Xb + 2) 



{Nab-2){Nab-S) 
^ (^±±l)(£|±i) , A.,, > 3 (43) 
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{Xa + 1){Xa + 2){Xa + 3)(Xb + l)(Xg + 2){Xb + 3) 

(Xab-2)(Xab-3)(Xab-4) 
, {Xa + 1){Xa + 2){Xb + 1){Xb + 2) 



+ 



(X^B-2)(X^B-3) 



{Nab - 2) 

{Xa + 1){Xa + 2)(X^ + 3)(X^ + 4)(Xb + 1)(Xb + 2){Xb + 3)(Xb + 4) 



(44) 



+ 6 
+ 7 
+ 



{Nab - 2){Nab - 3)(X^b - A){Nab - 5) 
(X^ + 1){Xa + 2)(Xa + 3)(Xb + 1){Xb + 2)(X^ + 3) 

[Nab-2){Nab-^){Nab-^) 

{Xa + 1){Xa + 2)(Xb + 1)(Xb + 2) 

(X^B-2)(X^B-3) 

(X^ + 1)(Xb + 1) . , ^_ 
- , with Nab > 5 



{Nab - 2) 

These may be used to calculate various statistical quantities. The standard deviation a 



(45) 



(X2) - (X)2, which using 1^ and (jS]), simplifies to give, 



a 



\ 



{Xa + 1){Xb + 1){Nab + Xa- 1){Nab + Xb - 1] 



Nab > 3 



(46) 



(47) 



{Nab - 2y{NAB - 3) 

The skewness 7 = ((X — (X))^) / {X'^)^^'^ , which expands to give, 

_ (X3)-3(X2)(X) + 2(X)3 
T - (X2)3/2 

and may be evaluated using to (j44l). The kurtosis is given by k = ((X - {X))^)/{X'^y 
which expands to give, 

(x^) - 4(x3)(x) + 6(x2)(x)2 - 3{xy 



{x^Y 



(48) 



and may be evaluated using (jlll) to (jlH]). Replacing X = X - X/, X^ = Xa - Nab, and 
Xb = Nb- Nab in (Sal) and (SS]), gives ([I9l) and (I2Q]) of the main text. 



B Poisson approximation 

Starting from ( ITTI) . use the approach of Chapman (1951) and Garcfa-Pelayo (2006) to find X for 
which P{X\Xa,Xb,Nab) is maximum, from P(X*|Xa, X^, X^b) = P(X* - 1|Xa, X^, X^s). 
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This gives X* = XaXb/Nj = {Na - Nab){Nb - Nab) IN f. When both Xa/Nab < 1 and 
Xb/Nab < 1, then both X* < Xa and X* < X^, and because Nj = Nab + Xa + Xb is 
larger than either Xa or Xb then X* <^ Nf also. 

Next note that {X + Xa)! = X^lXf exp{^^^i log(l + y/X^)}, as may be seen from 
expanding (X + X^)!, 

(X + Xa)! ={Xa + X){Xa + X~1)...{Xa + 1)Xa\ ^^^^ 
= XA!exp{Ej=i \og{y + XA)} 

where the last line repeatedly used AB = exp(log(A_B)) = exp(log(A) + log(_B)). Then write, 

XaI exp {e^=i log{y + Xa)} = X^! exp {eJLi log(XA(l + v/Xa))} 

= Xa\ exp {X log(XA) + E?=i log(l + v/Xa)} ^^^^ 
= Xa\ exp {log(Xf )} exp {eJ=i log(l + v/Xa)} 
= XaIXa"" exp {eJ=i log(l + v/Xa)} 

as originally stated. Similarly expanding (X + X^)! and (X + X/)!, gives, 

P(X|Xa,X^,X;) =K^^^ {^^f X 

exp {Ell log(l + ^/XA) + Ef=i log(l + j/Xb) - Eti log(l + k/Nf)} 

(51) 

The above expression is exact, and can be used as the starting point for a variety of approxima- 
tions. It is composed of the product of a Poisson distribution X*"^/X! with X* = XaXb/Nj, a 
constant term that ensures (ISTl) is correctly normalised, and an exponential term whose exponent 
is a function of X. As X becomes small relative to Xa, Xb, and Nf, the exponential's exponent 
tends to zero, and (ISTi) asymptotes to a Poisson distribution. However, because Xa < Nf and 
Xb < Nf, the exponential term's exponent is a strictly increasing function of X. Consequently a 
good approximation to (ISTl) by a Poisson distribution is only ever possible over a limited range of 
X. An approximation with a Poisson distribution to (15 ID can be found by approximating the expo- 
nential term in (ISTl) near X = X*. The rate of change of the exponential's exponent near X = X* 
can be estimated by considering the difference in its value between X* and (X* — 1), which is 
simply log[(l + X7XA)(l + XVXB)/(l + XVX/)]. Provided this rate of change is small, then a 
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Poisson distribution will provide a good approximation near the maximum of (ISTI) . If X*/X^ <^ 1 
and X*/Xb < 1 (implying X*/Nf < 1), then \og[{l+X*/XA){l + X*/XB)/{l + X*/Nf)] will 
be small, and the exponent will be approximately constant near X*. Therefore if X* /Xa ^ 1 
and X* /Xb ^ 1, the Poisson distribution provides a good approximation near the maximum of 
( 15 ip . If a precise and accurate approximation for the moments of (15T1) only requires a sufficiently 
precise approximation to (ISTI) near X = X* (we do not claim to show this here), then the 
Poisson distribution will provide a good approximation for the moments of (ISTl) . These remarks 
are consistent with the observations in the main text that: ( IlQll is always greater than ( !2Ti) . 
but provided that X*/Xa < 1 and X*/Xb < 1 (implying X*/Nf < 1), the exact (HH]) and 
approximated moments I^Bj, si's approximately the same (for a Poisson distribution, {X) = X* 
and cr^ = X*, e.g. see Stirzaker (1994)). The above calculation easily generalises to the case 
studied by Garcfa-Pelayo (2006) with n-persons searching, consequently similar remarks apply to 
that problem also. 



C Relation between (Bl) and ( 1101 ) 



Here the relationship between (JH]) and ( llOp is discussed. Firstly write (|H]) in terms of X = N — Nf, 

Xa = Na- Nab. Xb = Nb - Nab. and Nj = Nab + Xa + Xb. to give, 

^ ' ^' ^' X\{X + Nf)\ {X + Nf + iy ^ ^ 

Throughout this section we will only consider the case where P(X + Nf) = P{N) is constant. 
The moments of ( !52l) are then, 

v^oo xp ix+XAy.(x+XBy. 

fXP\ _ ^-^=° i^+Nf+l) X\{X+Nf+iy. , . 

/ ^nr, 1 {X+XAy.{X + XBy. ^ ' 



^X=0 (X+Nf+l) X\{X+Nf+iy 

where one of the factors of 1/{X + Nj + 1) has been incorporated into 1/(X + Nf + 1)!. If 
Nf ^ 1, then a good approximation can be made by replacing the extra factor of 1/(X + Xj + 1) 
by 1/{X + Nf + 2), giving, 

(X')^^^T5^g^HXW + 21 (54) 

2^X=0 X\(X+Nf+2y 
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where (X^)o[A^/ + 2] refers to moments of (!3Uf) . but with Nf replaced by Nf + 2, or equivalently 
(noting that Nf = Xa + Xb + Nab), by replacing A^^^ by Nab + 2, keeping Xa and Xb fixed 
everywhere else. Consequently the moments {XP)o[Nf + 2] are found by replacing Nab with 
Nab + 2 in (14^ - (145|) . keeping Xa and Xb fixed. Next we calculate bounds on the difference 
between (X^) and its estimation by {XP)o[Nf + 2]. 

Consider firstly an upper bound, 

v-oo xp (x+XAV-ix+Xgy. ^oo Yp {x+XA)Kx+XBy. 



(XO - {X^)o[Nf + 2] = ^S^x )■ - ix-S^Hxl^y (55) 

^ ' ^ ' L -I ' v^cxD 1 (X+Xa)-(X+Xb)- v^oo (X+Xa)-(X+Xb)- ^ ' 

2^X=0 (x+Nf+l) X\{X+Nf+iy. 2^X=0 X\{X+Nf+2y 

Noting that 1/(X + X/ + 2) < 1/(X + X/ + 1), and using this in the denominator of the first 
term, gives the bound, 

Y-oo XP (X+X^)!(X+Xb)! v^oo yp (X+Xa)!(X+Xb)! 

2^X=0 X!(X+Afj+2)! 

Expanding 1/(X + X/ + 2)! = (1/(X + X/ + 2))(1/(X + X/ + 1)!), gives, 

1 1 



x=o^ x\(x+Nf+iy 



(xo - (xOo[iV, + 2] < ^ f^^^igy^T^^ 

Z^X=0 X\(X+Nf+2V. 



This simplifies to. 



Eoo Y'P (^+^a)'(^+^-b)' 
X=0^ X!(X+Afj + l)! 



(A-) - {X'MN, + 21 < — (58) 

Z^X=0 X!(X+Af/+2)! 

Finally using (X + X^ + 2)! = (X + Xy + 1)!(X + Xy + 2) and 1/(X + Xy + 1) < l/(Xj + 1), 
we have the simplified upper bound of, 

{.Y')-{X'MiV, + 21<<^!M^ (59) 
Next consider a lower bound, again starting from, 

^poo XP (x+XAy-jx+xgy Yp {x+XAy{x+XBy 

(XO - {X%[Nj + 2] = ( "'fS:? )■ - x.x'Xx^y (60) 

^ ' ^ ' L J J v^CO 1 (X+Xa)-(X+Xb)- v^OO (A+Ayt)!(A+AB)! \ ^ 

2^X=0 {X+Nf+1) X\(X+Nf+iy 2^X=0 X\{X+Nf+2y 

This time using 1/(X + X/ + 1) > 1/(X + X/ + 2) in the numerator of the first term, gives the 
bound. 



(XP) - (XP)o[X/ + 2] > [Ex=o XP^^±^^i^i^±^ 



X 



X\(X+Nf+2y 

1 



^poo 1 {X+XAV-iX+XBV- {X+XA)KX+XBy- 

2^X=0 (X + Nf + 1) X](X+Nf + l)\ 2^X=0 X](X+Nf+2)\ 
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(61) 



Factorising the right hand side of the above expression gives, 



0! 



X 



X 



1 



{x+x^y.{x+xgy. 



Eoo ] 
X=0 (X+ATj + l) X'.{X+Nf + iy. 

"^oo (X+Xa)KX+Xb)\ _ Y-oo 
2^X=0 X\{X+Nf+2y. ^ 



^oo (X+X^)!(X+Xfl) 



X!(X+JV^ + 2)! 

1 



^=0 (X+Nf+1) X\{X+Nf + iy. 



Writing 1/(X + A^/ + 2)! = 1/{X + Nf + 1)\{X + Nf + 2), this becomes, 

X+Xa)KX+Xb 
X\{X+Nf+2y. 



{X^} - {X^)o[Nj + 2] > (Ex=o^^^^T!§TW^) 



1 



X=0 



(X+X^)!(X+Xg)! 



(X+JV^ + l) X!(Jf+JV^. + l)! 



y^oo (x+XAV-(x+Xgy. 
Z^x=o 



^=0 X!(X+Afj+l)! V^+A^/+2 X+Nf+1 



X!(X+JV^+2)! 
1 



That simplifies to, 

(XP) - {Xi')o[Nf + 2] > (Ex=o 



X!(X+Af/+2)! 



X 



1 



E 



(X+X^)!(X+Xa)! 



X=0 (X+JVJ- + 1) x<.(X+Nf + l)'. 

oo (X+XA)!(X+Xfl) 



^oo (X+X^)!(X+Xa)! 



X=0 X'.{X+Nf+2y. 

' )1 



^X=0 X\{X+Nf+iy. \{X+Nf+2){X+Nf+1) 

Now note that -1/(X + A^/ + 2) > -1/(A^/ + 2), which gives 

(XP) - {XP)o[Nf + 2] > (Ex=o 



X!(X+Afj+2)! 



1 



E 



(x+x^yix+xgy 



X=0 (X+Nf + l) X\(X+Nf + iy 



^oo (X+X^)!(X+Xa)! 
Z^X=0 X!(X+iVj.+2)! 



(7V/+2) E 



,7V/ +2 

Cancelling terms this leaves, 

(X^)-(X^)o[iV; + 2]> 

which is simply, 

{Xn-{Xno[Nf + 2]>(- 



(X+Xa)!(X+Xb)! 



^=0 X+Nf+1 X\{X+Nf + iy 

\ ■spcxD jx+XAyix+XBy 

1 \ ^X=0^ X\(X+Nf+2y 



f 



v^oo (X+Xa)!(X+Xb)! 
2^X=0 X\{X+Nf+2y 



{X%[Nf + 2y 



Nf + 2 



> 



Combining this ( !59l) with ( 1671 ). we have 

/-(X^')o[iV^ + 2] 



<(XO-(X^)o[iV/ + 2]< 
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-(XP)o[iV^ + 2] ' 



• (X^)o[iV^ + 2] ' 



Which gives, 

(XO = {X%[Nj + 2] (^1 ± (69) 

where the factor of ±l/{Nf + 1) gives a maximum error bound. Consequently, \f Nf ^ 1 then 
an excellent approximation to (X^) that is correct to within ±100/(A^/ + 1) percent, is given by 
(X^)o[A^/ + 2]. This approximation for the moments of ( 15^ is equal to the exact moments of 
( 1301) with Nab replaced by Nab + 2, keeping Xa and Xb fixed. Consequently using (14?I) we 
have, 

(X) = '^-' + ^<''" + ^> with (AT, -X,-X,)>0 (70) 
with a maximum error of ±{X)/{Nf + 1), which with the substitutions Xa = Na — Nab, 



Xb = Nb- Nab, and Nj = Na + Nb- Nab is (|23]) of the main text. Similarly using ([42]) and 
(I46p an approximation for cr^ is, 

which with the substitutions Xa = Na — Nab, Xb = Nb — Nab, and Nj = Na + Nb — Nab 
( ITT!) is ( !24l) of the main text. Unfortunately whereas ( iTOl) has a maximum error of order {X)/Nf, 
which is much less than (X) if Nf ^ 1, cr^ = (X^) — (X)^ can be arbitrarily small, but the 
maximum possible error remains of order (X^)/Xj. Therefore unless (X^)/Xj <^ 1, (ITT!) will not 
be guaranteed to give a good approximation for a^. As is noted in the main text, an alternative 
approach is to use the prior P{N) = k{N + 1)/{N + 2), for which ([70]) and (JTI]) are the exactly 
calculated moments. For that case this calculation gives the maximum difference between the 
moments with this, and with a prior that is independent of X. 
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